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Abstract 

Background: During the last few years, the knowledge of drug, disease phenotype and protein has been rapidly 
accumulated and more and more scientists have been drawn the attention to inferring drug-disease associations 
by computational method. Development of an integrated approach for systematic discovering drug-disease 
associations by those informational data is an important issue. 

Methods: We combine three different networks of drug, genomic and disease phenotype and assign the weights 
to the edges from available experimental data and knowledge. Given a specific disease, we use our network 
propagation approach to infer the drug-disease associations. 

Results: We apply prostate cancer and colorectal cancer as our test data. We use the manually curated drug-disease 
associations from comparative toxicogenomics database to be our benchmark. The ranked results show that our 
proposed method obtains higher specificity and sensitivity and clearly outperforms previous methods. Our result also 
show that our method with off-targets information gets higher performance than that with only primary drug targets 
in both test data. 

Conclusions: We clearly demonstrate the feasibility and benefits of using network-based analyses of chemical, 
genomic and phenotype data to reveal drug-disease associations. The potential associations inferred by our 
method provide new perspectives for toxicogenomics and drug reposition evaluation. 



Background 

Disease an intricate phenotype is usually caused by conge- 
nital disorder or dysfunctions of abnormal genes which 
induce multi-factor-driven alterations and disrupt func- 
tional modules [1]. Drugs achieve their therapeutic effect 
by changing downstream processes of their targets which 
contend with the alterations of those abnormal genes. 
The previous reports also showed that pharmaceutical 
company takes approximately 15 years and over $1 billion 
to develop a novel drug into the market and more than 
90% of experimental drugs fail to move beyond the early 
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clinical test stages [2,3]- Because drug discovery is 
complexity, time-consuming process and there are odds of 
low therapeutic efficacy and/or unacceptable toxicity [4,5]. 
With the merits of shorting development time and 
reducing risk, more and more scientists have been drawn 
attention to inferring drug-disease associations by compu- 
tational method. Development of an integrated approach 
for systematic discovering those associations is necessary. 

Several studies investigated some methods to increase 
the efficacy of drug discovery and they found there are 
positive and negative relationships between existing drugs 
and disease phenotypes. The Comparative Toxicoge- 
nomics Database (CTD; http://ctd.mdibl.org) is the public 
database which inferred chemical-disease associations 
by manually curated chemical-gene interactions, and 
gene-disease relationships from published literature [6]. 
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Cheng also presented a comprehensive predicted database 
of chemical -gene-disease associations (PredCTD) by 
integrating the information from chemical, gene, and 
disease [7]. Pharmacogenetics and pharmacogenomics 
knowledge base (PharmGKB) is a repository which con- 
tains the relationships between genomics, drug-response 
and its related phenotype and clinical information [8]. 
Eichborn developed PROMISCUOUS database which 
includes network-based resources of protein-protein and 
protein-drug interactions, side-effects and structural 
information [9]. 

The high-throughput microarray technology plays an 
important role in investigating drug-disease associations 
by providing a genome-side monitoring of gene expres- 
sion in the past decade. Some methods aims at restoring 
the abnormal state to normal state which means the 
expressions of the transcriptional level induced by drug 
should reverse those under disease state. On the other 
hand, if the differential expression profile under drug 
exposure and disease states is significantly anti-correlated, 
the drug compounds may have the potential to cure that 
disease. The Connectivity Map (Cmap) project is one 
of the most comprehensive and systematic approaches 
for drug-disease associations [10]. The Cmap provided 
a reference collection of genome-wide gene expressions 
profiles among drugs, which were obtained by systemati- 
cally exposing to few key cell lines [11]. Drug compounds 
negatively correlated to disease-specific gene signature 
may be the candidate therapeutic for further investigation. 
On the other hand, drug compounds positively correlated 
to gene signature are able to induce the disease phenotype. 
Li built disease-specific drug-protein associations derived 
from the Cmap by integrating gene/protein and drug con- 
nectivity information based on protein interaction network 
(PIN) and literature mining from PubMed abstracts [12]. 
Previous research used the "guilt by association" (GBA) 
approach, which assumed that when two diseases share 
similar therapies then the drug treats only one of the two 
might be also treat another, to predict novel drug-disease 
associations [13]. With a gold standard set of the drug- 
disease associations, Gottlieb designed a novel computa- 
tional method called PREDICT to identify drug-disease 
associations and also predict new drug indications based 
on their features including chemical structure, side 
effects, gene expression profile, and chemical-protein 
interactome [14]. However, to build an accurate predic- 
tion model based on different feature must have the 
positive and negative data to infer drug-disease associa- 
tions. There are some technically difficulties to obtain 
negative data such as non-drug targets due to the lack 
of value of research. Except learning a classifier to predict 
the associations between drugs and disease, network- 
based approach has been widely used to infer the relation- 
ships. In genetic and molecular biology, increasing 



evidences suggested that common functional modules 
are not affected by an individual gene but usually are 
organized by a group of interacting genes underlie similar 
diseases, which point out the therapeutic importance of 
those modules [15]. Therefore, the other basic hypothesis 
is that the mechanism of the drug and disease in 
the pathological processes may share similar functional 
modules. Daminelli created a drug-target-disease network 
and mined the bi-cliques where every drug is linked to 
every target and disease [16]. If the known data form 
an incomplete bi-clique, the incomplete relations in the 
bi-cliques to be identified as predicted links between drugs 
and diseases. Ye integrated known drug target information 
and proposed a disease-oriented strategy for evaluating the 
relationships between drugs and a specific disease based 
on their pathway profile [17]. 

The huge amount of chemical, genomic and disease phe- 
notype data is rapidly accumulated, but the drug-diseases 
associations are still not clear. For this purpose, we design 
a method of inferring drug- protein/gene-disease pheno- 
type relationships with a network propagation model, 
where genes with similar functional modules are related to 
not only drugs but also the disease phenotype. 

Methods 

We demonstrate the integrated network including three 
heterogeneous networks of the phenotype, drug, protein 
homo-networks and two hetero-networks capture interac- 
tions between two different homo-networks in Figure 1. 
The homo-network is defined as an undirected graph 
Gi = (Vi_ Ei) where Vi is the node set and Ei is the edge set 
in the homo-network i. Connections between two kinds of 
homo-networks define as hetero-network, where the 
nodes from different homo-networks are related to each 
other. The hetero-network is defined as bipartite graph 
Gij = (ViUVj, Eij). Here, Eij represents the set of edges 
which connect the nodes in different kinds of the homo- 
networks / and J. 

Construct phenotype homo-network 

A node in a phenotype homo-network as a disease pheno- 
type is extracted from Online Mendelian Inheritance 
in Man (OMIM) database [18]. We use a scoring schema 
of phenotypic similarities as edges that quantitatively 
measures the phenotypic overlap of OMIM records con- 
structed by van Driel [19] using text mining techniques. 
If the similarity score of two diseases falls in the range 
[0.6, 1], it means informative similarity which indicates 
potentially relevant phenotypic similarity. On the other 
hand, if the similarity score falls within [0, 0.3], it indicates 
non-informative similarity. Therefore, we apply a logistic 
function from [20] to convert the phenotypic similarity 
scores among diseases into a value as close to 1 as possible 
while the non-informative score into a value as close 
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Figure 1 The idea of our proposed metliod. 



to 0 possible over all the entries in the phenotype similar- 
ity score matrix. The symmetric similarity matrix Wp(pi,pj) 
in phenotype homo-network denotes the phenotypic simi- 
larity score between phenotypes pi and Pj. 

Construct drug homo-network using chemical similarity 

We extract the FDA-approved drugs and their canonical 
simplified molecular input line entry specification 
(SMILES) from DrugBank database [21] [22]. We calculate 
the hashed fingerprints using Chemical Development 
Kit (CDK) [23]. The chemical similarities are calculated by 
two hashed fingerprints using Tanimoto coefficient [24]. 
It calculates the size of the common substructures over 
the union between two fingerprints of the drugs which is 
defined as sim(x, x')=\x(\x'\l\x(ix'\ between two chemical 
structures of drug x and x'. The symmetric chemical struc- 
tures similarity matrix as the edge weight in drug homo- 
network is denoted as and each value falls in the range 
between zero (no bits in common) to unity (all bits the 
same) 

Construct protein interaction homo-network using gene 
expression data 

Protein-protein interactions provided an opportunity on 
the discovery of relationships among proteins in the 
mechanism of the drugs and human disease phenotype. 
However, the PIN has its disadvantages: first, the informa- 
tion in current PIN databases is partly complementary and 
the combination of the multiple databases could improve 
the knowledge of the protein interactions [25]. Second, 
PIN in those databases only provides the functional 



relations among the products of the genes and do not 
provide information about the conditions under which the 
interactions occur. Genome-wide expression profile can 
help us to extract the changes of the genes which are 
involved in the activity of a given disease or drug com- 
pound by comparing multiple case-control data sets. The 
co-expressed genes can reveal specific linkages which are 
more likely to function together, and we apply Pearson 
correlation coefficient for every pair-wise relation among 
genes. The positive correlation indicates an increasing 
linear relationship and negative correlation indicates 
a decreasing linear relationship. On the other hand, the 
correlation approaches to zero shows there would be little 
or no association. By visualizing gene expression over- 
expressed and down-expressed functional modules, we 
take the absolute value of correlation value to capture 
inhibitory activity (negative correlation) as well as activa- 
tion activity (positive correlation). We define the weight 
function as the product of the absolute value of correlation 
and the sum of the absolute value of differential expression 
changes between two corresponding genes in control and 
case samples. The symmetric weighted matrix between all 
interactions among all gene pairs is denoted as Wg and the 
higher weight denotes the stronger correlation or larger 
differential expression exchanges. 

where W^gj^gj) denotes the weight function from gene gi 
to gene gj. W^^^gA denotes the absolute value of Pearson 
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correlation coefficient of the interaction between gene gi 
and gj from case data. E^, and E'^. are the average gene 
expression values of gene gi in case sample and control 
sample. 

Integrated disease, protein interactions and chemical 
homo-networks 

The gene-phenotype hetero-network shows the relation- 
ships between disease phenotype and disease-associated 
genes extracted from OMIM database [18]. The drug- 
protein hetero-network denotes the drug and its targets 
which is obtained from DrugBank database [21]. The 
asymmetric matrices Wpg, W^g represent the adjacency 
matrices of link structures from phenotype-gene relation- 
ships and drug-target protein interactions, respectively. 
If drug di has a target gj, then W^gfdi, gj) = 1, otherwise 
Wag(dj, gj) = 0. When a drug target or disease-associated 
gene has no link with other proteins in PIN, we set the 
probability of connection to any other protein as 
where n is the total number of proteins in PIN. Since n is 
usually very large, so the probability will be very small. 
The reason that we use small probability instead of zero 
probability is to prevent a node in the network becoming 
a "sink node" in PIN and allows the probability to be 
propagated through the node. 

Networl< propagation in the Integrated networic 

We identify the inferring drug-disease associations 
problem as probability propagation over a network 
which simulates a random walker stochastically move 
on query phenotype to its immediate neighbors in het- 
erogeneous network [26,27]. We adopt the idea from 
[28] which developed a label propagation algorithm for 
an integrated network. 

In order to balancing the influence on the nodes which 
they are coimected in the network, we normalize all simi- 
larity matrices W to be a transition matrix S by a diagonal 
matrix Dr(i,i) which indicates the sum of row i while 
another diagonal matrix D/j,;) indicates the sum of column 
/ in the similarity matrix, respectively. If the similarity 
matrix of a homo-network is symmetric, the diagonal 
matrix D/i,i) is equal to D/),j). 



S = W (/, j) /^Dr a, i)D,{j,j) 



where i denotes the node in a homo-network and / 
denotes the node in the other homo-network. 

Given transition matrix S, diffusion parameter a and the 
vector with the initial probability distribution over 
nodes, the probability transition process in network 
propagation method on single network within t steps is 
defined as: 



The first term denotes the random walker can "restarts" 
to the initial probability distribution among the nodes 
with the diffusion parameter 1-a. The second term 
denotes an iterative walk to reach the further nodes in the 
network based on the transition matrix S and a diffusion 
parameter a. The random walker will be trapped at initial 
nodes if a is zero. Let P' be a probability distribution 
where a node in the network holds the probability of 
finding itself in the iterative random walker process up to 
the step t. After certain steps, the probabilities will reach 
a steady state which the difference between and P*'^ 
measured by L2 norm falls below a very small value such 
as 10'^. 

We extend the network propagation on a single network 
to our integrated network. The nodes receive the probabil- 
ities from other nodes in the same homo-network, and 
also can get the probabilities from nodes in other homo- 
networks through hetero-networks [29]. Therefore, the 
initial probability would be replaced by adding the addi- 
tional information from its immediate neighbors through 
hetero-networks [30]. The new initial probability vector 
in each homogeneous network i is proposed by 
a weighted sum which is formulated as follows: 



p° = aip° + b, 



Where «, and bij denote the weights in homo-network i 
and those between two homo-networks i and /, respec- 
tively. Sij denotes the normalized transition matrix; of the 
hetero-network. Then, we should keep the probability of 
the node be equal to one and we must further ensure the 
parameters aj and bij to satisfy: 



If homo-network i connect to other k homo-networks, 
we adopt the weight of bij the same as diffusion parameter 
a, to the immediate neighbors in the other homo- 
networks. We calculate + to, = 1 and obtain the weight 
= 1 - kUj. Finally, we further elaborate the network 
propagation method on a homo-network i into: 



Pi = (l-«0 



+ aiSip\ 



il-kai)p9+aiJ2SiiPj 



p^ = (l- a) p° + aSp' 



,t-i 



Thus, network propagation is calculated with an 
enriched initialization from the other homo-networks 
through hetero-subnetworks and the proof of convergence 
is in [31]. 

Given a query phenotype, we first set the initial prob- 
ability distribution over nodes where the probabilities to 
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the query disease nodes set to one and other nodes in the 
other homo-networks to zero. Second, we apply our net- 
work propagation method iteratively until the probability 
converges on each homo-network. Finally, we use the 
coverged probabilities of the nodes in homo-networks as 
Initail probability distribution and then repeated the 
network propagation processes until all homo-networks 
converge to a final probability distribution. 

Evaluation of association specificity between drug and 
disease 

In our method, we apply chemical similarity, gene expres- 
sion, and phenotype similarity data and the transition in 
the network propagation processes may skew the visitation 
frequencies towards those supplied data values. Since the 
frequencies of node visitations may also be highly biased 
by the linkages in the network topology, the probabilities 
of the nodes may directly reflect the relative centralities in 
the network based on the local network connectivity [32]. 
In order to control the topological biases in PIN, we calcu- 
late the reference visitation frequencies without taking the 
gene expression profile of the specific disease phenotype 
into consideration. Therefore, we set all the edge weights 
among genes to one in PIN and get the reference probabil- 
ities distribution P/"-^ over nodes in each homo-network i 
using our method. Then, we evaluate the specificity of the 
probability of a node using Z-score as the final normalized 
score distribution which reflects the relevance nodes 
related to the query phenotype. 



Pi{v) 



avg{ff) 



std 



Here, Pi(v) denote the probability of node v in homo- 
network i using genomic data calculated by our method. 
The functions avg and std denote the average and stan- 
dard deviation for the set of reference probabilities Pf'^-^ in 
homo-network i respectively. 

Results 

Gene expression profile 

We adopt microarray data taken from [33] that consists 
of 62 primary prostate tumors and 41 normal tissues 
from Stanford Microarray Database (SMD) [34]. We use 
genome-wide gene expression profiles from tissue samples 
of 18 healthy normal controls and 27 patients with color- 
ectal cancer evaluated by HG-U133 Plus 2.0 platform 



microarrays (Affymetrix, Santa Clara) through from Gene 
Expression Omnibus (GEO) database (GSE4183 and 
GSE4107) [35,36]. 

Protein interaction network 

We successfully obtained 137,037 interactions among 
13,388 genes by integrating five protein interaction 
network databases (HPRD, BIND, IntAct, MINT, and 
OPHID) and by mapping the UniProt protein ID to the 
human Entrez gene ID, erase the duplicated interaction 
pairs. 

Phenotype network and Phenotype-genotype 
hetero-network 

The OMIM database constructed the catalogue of genetic 
diseases in human and provides the phenotype-genotype 
association for 14,433 genes and 5,080 diseases [18]. 
The gene-phenotype hetero-network contains 275 disease 
phenotypes and 649 genes from 877 relations while 
mapping the genes in microarray data and PIN. 

Drug network and drug-target hetero-network 

We collect 1,571 FDA-approved drugs and 1,410 of them 
with available SMILES data in DrugBank database [21]. 
There are 4,456 relations between 1,215 drugs and 1,141 
targets to be the drug-target hetero-network. 

Benchmark of drug-disease associations 

We extract 53 and 106 known associations between drug, 

prostate cancer and colorectal cancer extracted from CTD 
database [6] in May 2013 as our benchmark. 

The performance of our method 

We compare our method with the previous state-of-the- 
art Cmap project to evaluate the performance [10]. 
We first prepare the ranked lists of over-expressed genes 
and down-expressed genes in both prostate and colorectal 
cancers which are conducted with affymetrix HG-U133A 
platform for Cmap project. The number of over-expressed 
and under-expressed gene signature in prostate cancer 
and colorectal cancer microarray data with different fold 
changes are shown in Table 1. Given the disease-specific 
gene signature as input query, the drug compounds 
ranked lists obtained from Cmap which are scored based 
on how well they are correlated with the input query. The 
score of 1 represents the input query perfect matches the 
changes among drug expression profiles in the database, 



Table 1 The number of gene signature with different fold changes in prostate cancer and colorectal cancer 



Cancer 






Prostate cancer 








Colorectal cancer 




Fold change 


1.0 


1.1 


1.2 


1.3 


1.4 


1.2 


1.4 1.6 1.8 


2 


# over-expressed genes 


89 


60 


45 


31 


27 


539 


308 175 106 


68 


# down-expressed genes 


232 


156 


115 


86 


58 


72 


42 27 12 


5 
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-1 represents perfect anti-correlation between them, and 
0 represents a null match. Both negative and positive 
enrichment belong to the drug-disease associations. 
Therefore, we take the absolute values of scores and 
re-rank them and higher score represents the stronger 
drug-disease associations. Due to Cmap project is derived 
from the expression profile of a drug compound to the 
isolated cell lines, multiple instances may correspond to 
same drug and even with same dose in the obtained 
results. We calculate the maximum score of multiple 
instances that correspond to the same drug as the asso- 
ciated score of a specific drug for dealing with such cases. 
In prostate and colorectal cancer data, we use 601 drug 
compounds which are overlapped between DrugBank and 
Cmap results and the drug rank list have shown in the 
previous works [37]. We compute observed the area under 
the receiver operator characteristic (ROC) curve (AUC) 
for analyzing the quality of performance in Figures 2 
and 3, respectively. The Cmap method obtains much 
lower AUC of 0.59 ± 0.04 and 0.59 ± 0.03 under the gene 
signature with different fold changes while our method 
obtains 0.94 and 0.89 in prostate cancer and colorectal 
cancer respectively. Previous study showed that only 
depending on the drug response expression profile is 
incapable to acquire the drug-disease associations accu- 
rately due to the profiles generated under different condi- 
tions [38]. There are several problems that limit the 
performance in Cmap project: First, the set of differentially 
expressed genes that constitute disease signatures or drug 



signatures were chosen empirically that cannot guarantee 
the biological relevance of the selected signatures. A bad 
selection of signatures may tend to capture similarities in 
the experimental settings rather than revealing the under- 
lying mechanisms. Second, only the overlap among genes 
between disease state and drug treatments are not quanti- 
fied as the overall effect of a drug and the values of differ- 
ential expression are also not taken into account in Cmap 
project. Therefore, the feasibility and benefits of using 
network-based analyses of chemical, genomic and pheno- 
type data provides a good chance to reveal drug-disease 
associations. 

The AUC with varying diffusion parameters 

We investigate the systematic effect of the different diffu- 
sion parameters among networks in our approach. We 
apply the diffusion parameters aj g{0.1, 0.3, 0.5, 0.7, 0.9} 
used in the experiments and 125 combinations of the 
parameters are tested. The ranking performances of our 
method with different combinations are measured 
by AUC values in Figure 4. The results show that the 
diffusion parameters set in drug, gene, and phenotype 
homo-network aj, ag_ and ttp as 0.1, 0.7 and 0.3 can get 
a higher AUC in both prostate and colorectal cancer. 
A higher similarity score between the chemical structures 
of a pair of drugs may sometimes have different targets to 
affect different functional modules in PIN. The drugs with 
similar chemical structures without binding to similar 
enzymes would reduce the predictive accuracy for drug 




Our Method (AUC=0.94) 

Cmap with fold change 1.0 (AUC=0.59) 

Cmap with fold change 1.1 (AUC=0.65) 

Cmapwith fold change 1.2 (AUC=0.53) 

Cmapwith fold change 1.3 (AUC=0.56) 

Cmapwith fold change 1.4 (AUC=0.63) 



1-specificity 



Figure 2 ROC curve among Cmap and our method in prostate cancer. 
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l-specificit\' 



Figure 3 ROC curve among Cmap and our method in colorectal cancer 



similarity [39]. On the other hand, the bit-comparison 
method of chemical structures using Tanimoto coefficient 
also has its limitations [40] : First, this kind of method does 
not include biochemical information at the atomic level of 
the representation. Second, sometimes it may yield low 
similarity values and it has an inherent bias related to the 
size of the molecules that are being sought. Furthermore, 
many physiological effects cannot be predicted accurately 



by chemical structure properties alone without more 
detailed information of metabolic and pharmacokinetic 
transformations of drugs [41]. Those may be the reason 
that why the diffusion parameter value in the drug 
network should be relatively smaller in our experiment. 
PIN tends to have much higher diffusion parameters to 
get higher AUC and it is reasonable to infer drug-disease 
associations not only by targeting the specific proteins 
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directly but also by modulating the pathways involved in 
the pathological process [42] . 

The performance of our method with different data source 

The drug is designed to target on the primary targets, but 
it also may interact with unexpected proteins (called off- 
targets) to trigger the associated biological processes and 
pathways. It may display undesired off-target toxicity or 
drug reposition for the disease phenotype. Therefore, we 
aim to add additional drug targets information to better 
understanding of dynamic processes under drug exposure. 
The database STITCH contains the relations between 
chemicals as chemical-chemical associations (CCA), and 
chemicals and their interacted proteins as drug-protein 
interactions (DPI) which are all curated by the evidence 
derived from experiments, publicly databases and 
the literature extraction [43]. However, the textual co- 
occurrence from text mining does not necessarily indicate 
meaningful relationships [44]. Therefore, we take 29,275 
chemical-chemical interactions and 45,567 chemical- 
protein pairs from STITCH excluded the relations 
extracted from the text mining method. In the rank lists of 
1,410 drugs, our method with the chemical structures 
similarity from CDK in drug homo-network and the 
chemical-protein interactions from STITCH in drug- 
protein hetero-network obtain highest AUC of 0.936 
and 0.861 in prostate cancer and colorectal cancer dataset 
in Figure 5 and 6, respectively. It shows that our method 
with more potential off-targets from STITCH gets higher 



AUC than that with only primary drug targets from 
DrugBank in both prostate cancer and colorectal cancer 
data. Due to only a limited number of chemical-chemical 
associations have been identified in STITCH [45], its 
performance is worse than the associations constructed by 
chemical structures similarity from CDK in drug homo- 
network. 

Case study: prostate cancer 

Potential drug and prostate cancer relations 

We run our method with diffusion parameters aj, ag_ and 
ttp as 0.1, 0.7 and 0.3 in drug, gene/protein, and phenotype 
homo-networks and display 26 drug-prostate cancer asso- 
ciations by evaluating the one-tailed test at 0.01 level of 
significance with Z-score > 2.33 in Table 2. Since the 
predictions are not regarded as true and we need to be 
further validated using external literature support, 
17 known associations are approved by the benchmark 
and other 7 associations are supported by the literature. 
The man with geriatric cancer may have prostate enlarge- 
ment or prostate cancer at a higher risk while taking fluox- 
ymesterone. On the other hand, if people has known or 
suspected prostate cancer, oxandrolone and drostanolone 
related to androgen receptors are not recommended to be 
taken in certain medical conditions. Previous clinical 
studies showed that prostate-specific antigen (PSA) plays 
an important tumor marker for prostate cancer in male 
and nandrolone phenpropionate is an androgen receptor 
agonist and previous study showed that it can decrease 




0.4 0.5 0.6 
1 -specificity 

Figure 5 ROC curve of our method with different data source supported in prostate cancer. 
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cell growth of prostate cancer LNCaP cells [46]. The 
primary action of cyproterone is to suppress the activity of 
the androgen hormones via competitive antagonism of 
the androgen receptor and inhibition of enzymes in the 
androgen biosynthesis pathway [47]. While the patients 
with metastatic prostate cancer were given mitomycin, 
the clinical results show good anti-tumour activity in 
metastatic prostate cancer and low toxicity in a phase II 
chemotherapy study [48]. Due to the androgens can 
stimulate the growth of prostate cancer cells, previous 
phase II trial study in 2012 demonstrated the hormone 



therapy using exemestane with or without bicalutamide 
may fight prostate cancer [49]. 
The significant functional modules related to the 
drug-prostate cancer association 

There are 31 genes with Z-score > 2.33 which are strongly 
related to 18 drugs and prostate cancer associations. We 
use the functional annotation analysis to investigate the 
functional enrichment of them via GSEA toolkits [50]. We 
select 4 annotation categories related to the functional 
pathways and processes including Biological Process 
(BP) in Gene Ontology (GO), KEGG, BIOCARTA, and 



Table 2 Drug-prostate cancer associations 



Drug ID 


Drug Name 


Score 


Drug ID 


Drug Name 


Score 


DB01420" 


Testosterone Propionate 


13.69 


DB01216" 


Finasteride 


4.85 


DB00621'' 


Oxandrolone 


12.76 


DB00367" 


Levonorgestrel 


4.46 


DB00984'' 


Nandrolone phenpropionate 


9.69 


DB00262" 


Carmustine 


416 


DB00858'' 


Drostanolone 


8.14 


DB06710" 


Methyltestosterone 


4.00 


DB04839'' 


Cyproterorne 


7.78 


DB00227" 


Lovastatin 


3.90 


DB00687'' 


Fludrocortisone 


7.44 


DB00783" 


Estradiol 


3.77 


DB00665" 


Nilutamide 


6.56 


DB01599 


Probucol 


3.77 


DBo lies'" 


Fluoxymesterone 


6.31 


DB00279 


Liothyronine 


3.22 


DB01128^ 


Bicalutamide 


6.20 


DB00421" 


Spironolactone 


3.01 


DB01395^ 


Drospirenone 


6.17 


DB01406" 


Danazol 


2.99 


DB00305'' 


Mitomycin 


5.24 


DB00624" 


Testosterone 


2.96 


DB00499" 


Flutamide 


5.07 


DB00396" 


Progesterone 


2.77 


DB00990'' 


Exemestane 


5.03 


DB00928" 


Azacitidine 


2.61 



a. Primary therapeutic function approved by our benchmarl<; b. supported by literature 
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REACTOME pathways to do the functional enrichment. 
The enriched functional biological processes and pathways 
of those genes with p-value < 0.05 are shown in Table 3 
and the detailed information is obtained from GSEA and 
DAVID gene functional classification tool [51] see Addi- 
tional file 1. The most significantly enriched terms of the 
regulation of apoptosis, cell cycle, p53 signaling and DNA 
repair indicate that those genes serve important roles 
in drug and prostate cancer association. The BRCA2 
mutation contributing to the young-onset prostate cancer 
has shown to be related to the Fanconi anemia pathway 
and DNA repair processes [52,53]. The prostate apoptosis 
response (Par) factor-related proapoptotic function is asso- 
ciated with prostate tumor progression and sheds light 
on the effects of the AR pathway on cell survival and 
apoptosis [54]. Interestingly, this functional module may 
be also for programmed cell death in response to 
induce apoptosis in neurodegenerative disorders such as 
Alzheimer's and Huntington's disease pathway [55]. 
With literature support, Par-4 has also been initially 
characterized in prostate cancer and also linked with 
the direct induction of apoptosis and even recognized to 
be a new target in pancreatic cancer [56]. 

Case study: colorectal cancer 

Potential drug and colorectal cancer relations 

We display 37 significant drug-colorectal cancer associa- 
tions with Z-score > 2.33 in Table 4. There are 28 known 
associations between drugs and colorectal cancer approved 
by our benchmark and 4 associations with literature 
support. Pyridoxal phosphate (PLP) is an important cofac- 
tor in the reactions of amino acid metabolism and 
previous studies indicated that increased blood PLP levels 
are associated with a reduced risk of colorectal cancer [57]. 
The previous findings showed the association between 
dietary fats and colorectal cancer, and a significant inverse 
association was found between colorectal cancer and 



Table 3 Functional modules related to the drug-prostate 
cancer associations 



Database 


Functional modules 


p-value 


REACTOME 


Fanconi anemia pathway 


4.65E-5 


KEGG 


Huntington's disease patiiway 


1.50E-4 


KEGG 


p53 signaling pathway 


9.72E-4 


BIOCARTA 


BRCA1, BRCA2 and ATR pathway 


1.61 E-3 


GO, KEGG 


Cell cycle 


1 .81 E-3 


REACTOME 


DNA Repair 


3,89E-3 


GO 


Negative regulation of cell proliferation 


9.75 E-3 


KEGG 


Alzheimer's disease pathway 


1.21 E-2 


GO 


Apoptosis 


1.26E-2 


KEGG 


Pancreatic cancer pathway 


1.69E-2 


KEGG 


Viral myocarditis 


1.83 E-2 


REACTOME 


Electron transport 


2.13E-2 



alpha-linoleic acid [58]. The results of this study suggest 
that substituting alpha-linoleic acid in the diet may reduce 
the risk of the colorectal cancer [58]. Previous works 
examined the effect of arsenic trioxide (AS2O3) at various 
concentrations on the cell growth of the colon cancer cell 
lines and showed that the growth of all cell lines was 
gradually suppressed with AS2O3 in comparison with that 
obtained without treatment [59,60]. Polyamines, organic 
compounds having two or more amino groups, have been 
shown to play an important role in the growth and survival 
in colorectal cancer [61]. We also find that spermine, 
a polyamine, may be a candidate target for therapeutic 
intervention in colorectal cancer [62]. The activity of the 
polyamine-synthesising enzyme, ornithine decarboxylase 
(ODC), is very highly expressed in proliferative HT-29 
colon cancer cells comparing to those from control 
samples, as well as in our case study [63]. L-ornithine is 
apparently efficiently utilized in the ODC pathway and 
is also considered as growth factors involved in cell proli- 
feration and differentiation regulated by amino acids meta- 
bolism [64]. The growing studies indicated the potential 
effectiveness of bortezomib in treatment of patients with 
HCT116 colon cancer by significantly increasing survivin 
expression [65,66]. The increasing evidences determined 
that the relationships among certain bacteria and cancer 
exist but the detail mechanism still unclear [67]. In our 
results, we interestingly find that an antibiotic drug zana- 
mivir for bacterial infection may cause or cure colorectal 
cancer [68], 

The significant functional modules related to the drug- 
colorectal cancer association 

The enriched functional biological processes of 44 
significant genes in drug-colorectal cancer associations by 
investigating the functional enrichment are shown in 
Table 5 and the detailed information see Additional file 2. 
The popular enriched functional terms of the gene mod- 
ules are regulation of Glycogen synthase kinase 3 (GSK-3) 
pathway, innate immune system, Wnt signalling pathway 
[69] and apoptosis pathway which all appear to be promis- 
ing biological processes associated with colorectal cancer. 
The activity of GSK3b expression in colorectal cancer 
patients is higher than those in their normal samples and 
the GSK3b inhibitor may induce apoptosis in human 
colorectal cancer cells [70,71]. Lipopolysaccharide (LPS) 
elicits several immediate proinflammatoy responses 
including CD14, Toll-like receptors, phosphatidylinositol- 
3'-kinase (PI3K)/Akt signaUng, myeloid differentiation fac- 
tor, nuclear factor kappa-light-chain-enhancer of activated 
B cells (NF-kB) transcription factors, and also promotes 
downstream bl integrin function in tumor growth and 
progression thereby increasing the adhesiveness and meta- 
static capacity of colorectal cancer cells [72-74]. Here, we 
present evidence that S100A8/A9 actives the downstream 
genes associated with Mitogen-activated protein kinases 
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Table 4 Drug-colorectal cancer associations 



Drug ID 


Drug Name 


Score 


Drug ID 


Drug Name 


Score 


DB0017^ 


Adsnosine triphosphat6 


12.35 


DB00313" 


Valproic Acid 


4.20 


0601593" 


Zinc 


8.66 


DB00162 


Vitamin A 


3.77 


DB00591 


Fluocinolons Acstonids 


8.25 


DB00131 


Adenosine monophosphate 


3.74 


DB00163" 


Vitamin E 


7.13 


DBOOI29'' 


L-Ornithine 


3.72 


DB01262'' 


Decitabine 


7.02 


DB00947" 


Fulvestrant 


3.46 


DB00435" 


Nitric Oxide 


6.70 


DB00741" 


Hydrocortisone 


3.34 


DB00783" 


Estradiol 


6.40 


DB01064" 


Isoproterenol 


3.21 


DB00328" 


Indomethacin 


6.35 


DB01 128" 


Bicalutamide 


3.1 7 


DB00755" 


Tretinoin 


6.26 


DB00773" 


Etoposide 


3.15 


DB00114'' 


Pyridoxal Phosphate 


5.91 


DB00481" 


Raloxifene 


3.10 


DB00544" 


Fluorouracil 


5.62 


DB00305" 


Mitomycin 


3.05 


DB00132'' 


Alpha-Linolenic Acid 


5.61 


DB00188'' 


Bort:ezomib 


2.77 


DB00396" 


Progesterone 


5.59 


DB01005" 


Hydroxyurea 


2.66 


DB00997" 


Doxorubicin 


5.52 


DB00619" 


Imatinib 


2.61 


DB001 79" 


Masoprocol 


5.09 


DB02546" 


Vorinostat 


2.53 


DB00558 


Zanamivir 


5.03 


DB00928" 


Azacitidine 


2.47 


DB01169" 


Arsenic trioxide 


4.38 


DB00498" 


Phenindione 


2.44 


DB00127" 


Spermine 


4.38 


DB00548 


Azelaic Acid 


2.39 


DB00834" 


Mifepristone 


4.21 









a. Primary tlierapeutic function approved by our benchmark; b. supported by literature 



(MAPK) and NF-kB signaling pathways to promote tumor 
growth and metastasis and the expression of S100A8/A9 
on myeloid cells is also essential for development of colon 
tumors in mice model [75]. Untersmayr detected Fc epsi- 
lon RI (FcERI)-positive epithelial cells in the colon cancer 
patients [76]. There has been a growing importance of the 



neurotrophin signalling in a variety of human cancers 
including colorectal cancer and malignant gliomas in 
particular [77]. Our study also denotes that the signifi- 
cances of p53, hypoxia inducible factor- 1 alpha pathway, 
and vascular endothelial growth factor (VEGF) expression 
in colorectal cancer [78]. 



Table 5 Functional modules related to the drug-colorectal cancer associations 



Database 


Functional modules 


p-value 


BIOCARTA 


GSK3 pathway 


7.29E-14 


KEGG 


Endometrial cancer 


2.30F-1 1 


KEGG 


Colorectal cancer 


4.32E-9 


KEGG, BIOCARTA, REACTOME 


Toll like receptor pathway 


5.79E-9 


KEGG 


Prostate cancer 


1 .36E-6 


BIOCARTA 


P53hypoxia pathway 


1.93E-6 


KEGG 


Lung cancer 


2.30E-6 


KEGG 


Glioma 


5.82E-6 


KEGG 


Renal cell carcinoma 


840E-6 


KEGG 


Melanoma 


9.01 E-6 


KEGG 


Chronic myeloid leukemia 


1 .03E-5 


KEGG 


VEGF signaling pathway 


1.26E-5 


KEGG 


Innate Immune System 


1.29E-5 


KEGG 


Fc epsilon RI signaling pathway 


1.52E-5 


REACTOME 


Regulation of ornithine decarboxylase (ODC) 


4.28E-5 


KEGG 


Pancreatic cancer 


1.74E-4 


KEGG 


Wnt signaling pathway 


3.34E-4 


GO 


Apoptosis 


1.54E-3 


KEGG 


Neurotrophin signaling pathway 


1.62E-3 
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Conclusions 

We integrate the information of drug, genomic and 
disease phenotype from available experimental data and 
knowledge as weighted networks and their connected 
relationships together. We apply disease-oriented net- 
work propagation approach for inferring and evaluating 
the likelihood of the probability between drugs and 
query disease. In our experiment, we adopt the prostate 
cancer and colorectal cancer as our case study and the 
results clearly outperform previous Cmap project. Our 
results are also found to be significantly enriched in 
both the biomedical literature and clinical trials. The 
success of our methods can be attributed as follows: 
First, we integrate heterogeneous data and knowledge 
about disease phenotype, chemical structure of drugs, 
and gene expression into our model. Second, our 
network propagation method combines the information 
not only from the single network but also derived the 
information from other connected homo-networks to 
infer the drug-disease association. Finally, our method 
with off-targets information gets higher performance 
than that with only primary drug targets in both test 
data. We believe that the combination of network and 
heterogeneous data source could help us to generate 
new hypotheses to infer the drug-disease associations 
and even speed up the drug development processes. 
Our study provides opportunities for future toxicoge- 
nomics and drug discovery applications but the limita- 
tion is the difficulty in distinguishing the positive and 
negative associations between drug and disease. In the 
future, we can choose different methods to calculate the 
chemical structural similarity between drugs, which could 
improve the limitations by using Tanimoto coefficient. 
On the other hand, our approach heavily relies on the 
weights for the edges in each network derived from the 
existing knowledge of drugs, targets, protein, disease or 
reported databases, or experimental results from the pub- 
lic database and the incompleteness of such information 
would limit our prediction power. We can also integrate 
various data sources such as drug response profile, 
side effect and pharmacological data and therapeutic/ 
toxicological expression profiles to verify the reliability 
and confidence of the interactions. 

Additional material 



Additional file 1 : The functional enrichment canonical pathways of 
the genes in prostate cancer We filter the functional enrichment 
canonical pathways of the overlap of the significant genes related to the 
drug-prostate cancer associations with at least 3 members in each 
functional category and p-value<0.05 using GSEA and DAVID online 
toolkit. 

Additional file 2: The functional enrichment canonical pathways of 
genes in colorectal cancer We filter the functional enrichment 
canonical pathways of the overlap of the significant genes related to the 



drug-colorectal cancer associations with at least 3 members in each 
functional category and p-value<0.05 using GSEA and DAVID online 
toolkit. 
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